An Exploratory Data and Network Analysis of Movies
Introduction
In this report, we will be analysing a dataset from Kaggle, which contains movies of different genres produced over a vast number of years. What makes this analysis interesting is that we can try and draw various conclusions based on a movie’s popularity, directors or actors involved, year of production, and so forth. Moreover, we can construct various networks in an attempt to find meaningful and interesting results. When inspecting a database of films from recent years, various interesting inferences are uncovered. A film may have a high rating yet low return on investment (ROI). Which genre would you guess is the most successful? Which actors do you think are the most popular?
We have split our Exploratory Data Analysis into four main parts:
| Section | |
|---|---|
| 1 | Introducing the Data - We first try to understand the data and look at its content. |
| 2 | Pre-Processing - We look at what needs to be altered or removed from the dataset. - We try clean any dirty text. - We try to minimise the dataset’s missing values. |
| 3 | Exploring the Data - We conduct basic analysis on the dataset. - We explore genres. - We explore movie popularity. - We look at profit, gross, and return of interests with movies. - We conduct more advanced analysis on the dataset. |
| 4 | Network Analysis - We measure the network (centrality, degree distribution, number of components, average degree) - We use network measures to highlight certain nodes (actors) and see which measures of an actor will increase ratings and budgets. |
Admin
Before we start, let’s keep this code chunk for importing the correct libraries and loading the appropriate dataset. We use pacman to load the following:
We import the dataset like this:
In the next section we can introduce our dataset and look its content.
Introducing The Dataset
This section of the report is quite essential for our analysis. We cannot make any interesting inferences from the dataset if we do not know what is contained within it. In this section we will try to understand exactly what we are dealing with. Thereafter, we can begin to draw interesting results. We have already read in our dataset called movie_metadata, so we can see the following:
The dataset contains 28 unique columns/variables, each of which are described in the table below:
| Variable Name | Description |
|---|---|
| color | Specifies whether a movie is in black and white or color |
| director_name | Contains name of the director of a movie |
| num_critic_for_reviews | Contains number of critic reviews per movie |
| duration | Contains duration of a movie in minutes |
| director_facebook_likes | Contains number of facebook likes for a director |
| actor_3_facebook_likes | Contains number of facebook likes for actor 3 |
| actor_2_name | Contains name of 2nd leading actor of a movie |
| actor_1_facebook_likes | Contains number of facebook likes for actor 1 |
| gross | Contains the amount a movie grossed in USD |
| genres | Contains the sub-genres to which a movie belongs |
| actor_1_name | Contains name of the actor in lead role |
| movie_title | Title of the Movie |
| num_voted_users | Contains number of users votes for a movie |
| cast_total_facebook_likes | Contains number of facebook likes for the entire cast of a movie |
| actor_3_name | Contains the name of the 3rd leading actor of a movie |
| facenumber_in_poster | Contains number of actors faces on a movie poster |
| plot_keywords | Contains key plot words associated with a movie |
| movie_imdb_link | Contains the link to the imdb movie page |
| num_user_for_reviews | Contains the number of user generated reviews per movie |
| language | Contains the language of a movie |
| country | Contains the name of the country in which a movie was made |
| content_rating | Contains maturity rating of a movie |
| budget | Contains the amount of money spent in production per movie |
| title_year | Contains the year in which a film was released |
| actor_2_facebook_likes | Contains number of facebook likes for actor 2 |
| imdb_score | Contains user generated rating per movie |
| aspect_ratio | Contains the size of the aspect ratio of a movie |
| movie_facebook_likes | Number of likes of the movie on its Facebook Page |
Furthermore, the dataset contains 5043 movies, spanning accross 96 years in 46 countries. There are 1693 unique director names and 5390 number of actors/actresses. Around 79% of the movies are from the USA, 8% from UK, and 13% from other countries.
The structure of the dataset can also be used to understand our data. We can run the following code chunk to see its structure.
## 'data.frame': 5043 obs. of 28 variables:
## $ color : chr "Color" "Color" "Color" "Color" ...
## $ director_name : chr "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : chr "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
## $ genres : chr "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
## $ actor_1_name : chr "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
## $ movie_title : chr "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : chr "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
## $ facenumber_in_poster : int 0 0 1 0 0 1 0 1 4 3 ...
## $ plot_keywords : chr "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
## $ movie_imdb_link : chr "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : chr "English" "English" "English" "English" ...
## $ country : chr "USA" "USA" "UK" "USA" ...
## $ content_rating : chr "PG-13" "PG-13" "PG-13" "PG-13" ...
## $ budget : num 2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
## $ title_year : int 2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
In the next section we can start preparing the dataset for analyis by removing or simplifying some of the data.
Pre-Processing Data
In this part of the report we attempt to look for various things that may have a negative or insignificant impact on the inferences we make on the dataset. Once we have sufficiently cleaned and prepared the dataset, we can commence with drawing various conclusions from the graphs we generate.
Duplicate Rows
In movie_metadata, we have some duplicate rows, so we want to remove the 45 duplicated rows and keep the unique ones.
## [1] 45
Missing Values
Let’s have a look at the number of NA values in our dataset:
## color director_name
## 0 0
## num_critic_for_reviews duration
## 49 15
## director_facebook_likes actor_3_facebook_likes
## 103 23
## actor_2_name actor_1_facebook_likes
## 0 7
## gross genres
## 874 0
## actor_1_name movie_title
## 0 0
## num_voted_users cast_total_facebook_likes
## 1 1
## actor_3_name facenumber_in_poster
## 0 13
## plot_keywords movie_imdb_link
## 0 0
## num_user_for_reviews language
## 22 0
## country content_rating
## 0 0
## budget title_year
## 488 108
## actor_2_facebook_likes imdb_score
## 14 1
## aspect_ratio movie_facebook_likes
## 328 1
To help visualise this, have a look at the following heatmap of the missing values:
# Visualise Missing Values
missing.values <- aggr(movie_metadata, sortVars = T, prop = T, sortCombs = T, cex.lab = 1.5, cex.axis = .6, cex.numbers = 5, combined = F, gap = -.2)##
## Variables sorted by number of missings:
## Variable Count
## gross 0.174869948
## budget 0.097639056
## aspect_ratio 0.065626251
## title_year 0.021608643
## director_facebook_likes 0.020608243
## num_critic_for_reviews 0.009803922
## actor_3_facebook_likes 0.004601841
## num_user_for_reviews 0.004401761
## duration 0.003001200
## actor_2_facebook_likes 0.002801120
## facenumber_in_poster 0.002601040
## actor_1_facebook_likes 0.001400560
## num_voted_users 0.000200080
## cast_total_facebook_likes 0.000200080
## imdb_score 0.000200080
## movie_facebook_likes 0.000200080
## color 0.000000000
## director_name 0.000000000
## actor_2_name 0.000000000
## genres 0.000000000
## actor_1_name 0.000000000
## movie_title 0.000000000
## actor_3_name 0.000000000
## plot_keywords 0.000000000
## movie_imdb_link 0.000000000
## language 0.000000000
## country 0.000000000
## content_rating 0.000000000
Gross and Budget
Since gross and budget have too many missing values (874 and 488), and we want to keep these two variables for the following analysis, we can only delete rows with null values for gross and budget because imputation will not do a good job here.
# Find NA values for gross and budget
movie_metadata <- movie_metadata[!is.na(movie_metadata$gross), ]
movie_metadata <- movie_metadata[!is.na(movie_metadata$budget), ]
dim(movie_metadata)## [1] 3857 28
The difference in observations have decreased by 4998 - 3857 = 1141 which is luckily only 22.8% of the previous total observations. Let’s have a look at how many complete cases we have.
Content Rating
##
## Approved G GP M NC-17 Not Rated
## 51 17 91 1 2 6 42
## Passed PG PG-13 R Unrated X
## 3 573 1314 1723 24 10
According to the history of naming these different content ratings, we find M = GP = PG, X = NC-17. We want to replace M and GP with PG, replace X with NC-17, because these two are what we use nowadays.
movie_metadata$content_rating[movie_metadata$content_rating == 'M'] <- 'PG'
movie_metadata$content_rating[movie_metadata$content_rating == 'GP'] <- 'PG'
movie_metadata$content_rating[movie_metadata$content_rating == 'X'] <- 'NC-17'We want to replace Approved, Not Rated, Passed, Unrated with the most common rating R.
movie_metadata$content_rating[movie_metadata$content_rating == 'Approved'] <- 'R'
movie_metadata$content_rating[movie_metadata$content_rating == 'Not Rated'] <- 'R'
movie_metadata$content_rating[movie_metadata$content_rating == 'Passed'] <- 'R'
movie_metadata$content_rating[movie_metadata$content_rating == 'Unrated'] <- 'R'
movie_metadata$content_rating <- factor(movie_metadata$content_rating)
table(movie_metadata$content_rating)##
## G NC-17 PG PG-13 R
## 51 91 16 576 1314 1809
Blanks should be taken as missing value. Since these missing values cannot be replaced with reasonable data, we delete these rows.
Delete (Some) Rows
## color director_name
## 0 0
## num_critic_for_reviews duration
## 1 0
## director_facebook_likes actor_3_facebook_likes
## 0 6
## actor_2_name actor_1_facebook_likes
## 0 1
## gross genres
## 0 0
## actor_1_name movie_title
## 0 0
## num_voted_users cast_total_facebook_likes
## 0 0
## actor_3_name facenumber_in_poster
## 0 6
## plot_keywords movie_imdb_link
## 0 0
## num_user_for_reviews language
## 0 0
## country content_rating
## 0 0
## budget title_year
## 0 0
## actor_2_facebook_likes imdb_score
## 2 0
## aspect_ratio movie_facebook_likes
## 55 0
We remove aspect_ratio because 1 it has a lot of missing values and 2 we will not be looking into the impact that it has on other data (we assume that it doesn’t).
Add a Column
Gross and Budget
We have gross and budget information. So let’s add two colums: profit and percentage return on investment for further analysis.
Remove (Some) Columns
Colour
Next, we take a look at the influence of colour vs black and white.
##
## Black and White Color
## 2 124 3680
Since 3.4%of the data is in black and white, we can remove the color column it.
Language
Let’s have a look at the different languages contained within the dataset.
##
## Aboriginal Arabic Aramaic Bosnian Cantonese
## 2 2 1 1 1 7
## Czech Danish Dari Dutch English Filipino
## 1 3 2 3 3644 1
## French German Hebrew Hindi Hungarian Indonesian
## 34 11 2 5 1 2
## Italian Japanese Kazakh Korean Mandarin Maya
## 7 10 1 5 14 1
## Mongolian None Norwegian Persian Portuguese Romanian
## 1 1 4 3 5 1
## Russian Spanish Thai Vietnamese Zulu
## 1 24 3 1 1
Almost 95% movies are in English, which means this variable is nearly constant. Let’s remove it.
Country
Next, we can look at the different types of countries.
##
## Afghanistan Argentina Aruba Australia Belgium
## 1 3 1 40 1
## Brazil Canada Chile China Colombia
## 5 63 1 13 1
## Czech Republic Denmark Finland France Georgia
## 3 9 1 103 1
## Germany Greece Hong Kong Hungary Iceland
## 79 1 13 2 1
## India Indonesia Iran Ireland Israel
## 5 1 4 7 2
## Italy Japan Mexico Netherlands New Line
## 11 15 10 3 1
## New Zealand Norway Official site Peru Philippines
## 11 4 1 1 1
## Poland Romania Russia South Africa South Korea
## 1 2 3 3 8
## Spain Taiwan Thailand UK USA
## 22 2 4 316 3025
## West Germany
## 1
Around 79% movies are from USA, 8% from UK, 13% from other countries. So we group other countries together to make this categorical variable with less levels: USA, UK, Others.
levels(movie_metadata$country) <- c(levels(movie_metadata$country), "Others")
movie_metadata$country[(movie_metadata$country != 'USA')&(movie_metadata$country != 'UK')] <- 'Others'
movie_metadata$country <- factor(movie_metadata$country)
table(movie_metadata$country)##
## Others UK USA
## 465 316 3025
Now that we’ve cleaned up our dataset, we can now continue to explore our data even further! In the next section we will be looking at genres, movie popularity, gross, profit, and many more other aspects pertinent to our data.
Analysing Data
When inspecting a dataset of movies over the past few years, various interesting inferences can be uncovered. A movie may have a high rating yet low return on investment. Which genre is the most successful? Which actors are the most popular? These are some of the questions we aim to answer in this section.
We can start by performing basic analyis on our data. Thereafter, we delve a bit deeper into more specific parts of the dataset, in hopes of uncovering interesting observations.
Basic Analysis
Let’s first have a look at the number of movies that are produced over the years.
ggplot(movie_metadata, aes(title_year)) +
geom_bar() +
labs(x = "Year movie was released", y = "Movie Count", title = "Number of Movies Released Per Year (1916 - 2016)") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(xintercept=c(1980), linetype="dotted")From the graph, we see there aren’t many records of movies released before 1980. It’s better to remove those records because they might not be representative.
Let’s have a look at the movie counts now:
ggplot(movie_metadata, aes(title_year)) +
geom_bar() +
labs(x = "Year movie was released", y = "Movie Count", title = "Number of Movies Released Per Year (1980 - 2016)") +
theme(plot.title = element_text(hjust = 0.5))Movie Genre Analysis
Now we can delve into more specific things regarding movies, like genres.
Top Genres
genre = movie_metadata['genres']
genre = data.frame(table(genre))
genre = genre[order(genre$Freq,decreasing=TRUE),]
# Top 20 genres with the most movies
ggplot(genre[1:20,], aes(x=reorder(factor(genre), Freq), y=Freq, alpha=Freq)) +
geom_bar(stat = "identity", fill="blue") +
geom_text(aes(label=Freq),hjust=1.2, size=3.5)+
xlab("Genre") +
ylab("Number of Movies") +
ggtitle("Top 20 genres with the most movies") +
coord_flip()Split Genres
As you can see, movies have multiple genres that its associated with. For analysis purposes, we choose to use the first word in the genre column, as this is likely the most accurate description.
## [1] "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy"
## [3] "Action|Adventure|Thriller" "Action|Thriller"
## [5] "Action|Adventure|Sci-Fi" "Action|Adventure|Romance"
Let’s split the genres separated by “|” into 8 different columns.
# Split on "|"
genres_split <- str_split(movie_metadata$genres, pattern="[|]", n=2)
# Create Matrix
genres_matrix <- do.call(rbind, strsplit(movie_metadata$genres, '[|]'))
# Dataframe of genres
genres_df <- as.data.frame(genres_matrix)genre_df consists of 8 columns, each with different genres. Let’s have a look at the frequency of all the genres.
# Collapse all genres into one column
genres_one_col <- gather(genres_df) %>%
select(value)
# Plot frequency of first column
genres_one_col %>%
group_by(value) %>%
tally() %>%
filter(n >= 30) %>%
ggplot() +
geom_bar(aes(x = value, y=n), stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Genre") +
ylab("Frequency") +
NULL It is evident that the Drama and Comedy genre are the most popular to be produced. However, this does not mean that they are the most profitable, returning successful ROI’s. This will further be explored.
Previously we assumed that the first genre is the most applicable, therefore, we choose the first column as the genre for the movie and append it to the dataframe.
# Take old genres column away
movie_metadata <- subset(movie_metadata, select = -c(genres))
# Take first column of genres_df and add it to MAIN df
movie_metadata$genre <- genres_df$V1How does this distribution look like over the years? Lets have a look at the frequency of genres between the period of 1980 and 2016.
movie_metadata %>%
group_by(title_year, genre) %>%
summarise(count = n()) %>%
ggplot(aes(title_year, as.factor(genre))) +
geom_tile(aes(fill=count),colour="white") +
scale_fill_gradient(low="light blue",high = "dark blue") +
xlab("Year of Movie") +
ylab("Genre of Movie") +
ggtitle("Heat Map of Movie Genres Produced Over the Years") +
theme(panel.background = element_blank())The heat map allows us to see that the popular genres with high frequencys are constantly being produced more and more often over the years. It is evident by the darker shades of blue becomming more prominent in the latter years.
Which Genres are Popular?
In our dataset we have Facebook Likes and IMDB scores associated with a movie. This can help give us an indication of how popular each genre is.
# Add column for total facebook likes
movie_metadata$total_facebook_likes <- movie_metadata$director_facebook_likes +
movie_metadata$actor_3_facebook_likes + movie_metadata$actor_1_facebook_likes +
movie_metadata$cast_total_facebook_likes + movie_metadata$actor_2_facebook_likes +
movie_metadata$movie_facebook_likes# creating a data frame containing avg score, avg votes and avg fb likes
score_votes_likes <- movie_metadata %>%
group_by(genre) %>%
summarise(count = n(),
avg_score = round(mean(imdb_score), 1),
avg_votes = mean(num_voted_users),
avg_facebook_likes = mean(total_facebook_likes),
avg_reviews = mean(num_user_for_reviews)) %>%
filter(count > 10)
# arranging data frame by average score
arr_score <- arrange(score_votes_likes, desc(avg_facebook_likes))
as.data.frame(score_votes_likes)## genre count avg_score avg_votes avg_facebook_likes avg_reviews
## 1 Action 937 6.3 145049.12 38762.692 462.1942
## 2 Adventure 356 6.5 133549.07 40494.340 355.0365
## 3 Animation 43 6.7 83318.51 30471.674 202.0698
## 4 Biography 200 7.1 100233.72 41000.900 260.4650
## 5 Comedy 997 6.1 64763.60 23896.765 197.9388
## 6 Crime 247 6.9 130843.95 33857.397 362.5789
## 7 Documentary 32 6.8 16159.19 5623.312 126.3750
## 8 Drama 667 6.8 85462.68 31889.940 319.7301
## 9 Fantasy 37 6.3 87311.14 25452.270 374.7027
## 10 Horror 152 5.7 66091.12 21356.158 406.8750
## 11 Mystery 24 6.6 180824.04 35571.625 570.2500
Genres VS IMDB Scores
Observing the graphs below, various genres consist of an array of different ratings. However, niche genres are more consistent with their ratings, including musical; thriller; romance and western.
# genre wise distribution of scores using a custom plot function
facet_hist_plot(movie_metadata$imdb_score,
"Genre",
bwidth = 0.1)+
scale_x_continuous(breaks = seq(0, 9.3, 1)) # plotting 10 highest rated genres
bar_plot(arr_score[1:10,],
arr_score[1:10,]$avg_score,
arr_score[1:10,]$avg_score,
"| Average Score:",
"Genres",
"Average IMDB Score",
"Top 10 IMDB Scores Per Genre",
deci = 1)The graph above allows us to see that a less produced category, bibliography, possess the most positive ratings.
Genres VS Facebook Likes
Let’s now have a look at the Facebook likes.
# plotting 10 highest rated genres
bar_plot(arr_score[1:10,],
arr_score[1:10,]$avg_facebook_likes,
arr_score[1:10,]$avg_facebook_likes,
"| Average Score:",
"Genres",
"Average Facebook Likes",
"Top 10 Facebook Likes Per Genre",
deci = 1)Biography also produces the highest likes on facebook. From this information, should directors strive to create more biography’s? This will be determined when looking at the ROI.
Popularity Analysis
IMDB ratings VS Movie Count
ggplot(movie_metadata, aes(imdb_score)) +
geom_histogram(bins = 50) +
geom_vline(xintercept = mean(movie_metadata$imdb_score,na.rm = TRUE),colour = "blue") +
ylab("Movie Count") +
xlab("IMDB Rating") +
ggtitle("IMDB Ratings for Movies") +
ggplot2::annotate("text", label = "Mean IMDB rating",x = 6.2, y = 50, size = 3, colour = "yellow",angle=90) It is evident from the histogram that the majority of movies are rated between 6-8 out of 10. If a movie has a lower rating, it is clearly very unpopular or very bad. If a movie has a higher rating, it is evidently fantastic.
Popularity over the years
#Creating the required subset of data
movies_pop <- movie_metadata %>%
select(title_year, movie_facebook_likes) %>%
filter(title_year > 1980) %>%
group_by(title_year) %>%
summarise(avg = mean(movie_facebook_likes))
#Generating the popularity Vs time plot
ggplot(movies_pop, aes(x = title_year, y = avg)) +
geom_point() +
geom_smooth() +
geom_vline(xintercept = c(1990,2004),colour = c("orange","blue")) +
ylab("Mean Popularity Score") +
xlab("Years") +
ggplot2::annotate("text", label = "Facebook",x = 2003, y = 80, size = 3, colour = "blue",angle=90)+
ggplot2::annotate("text", label = "IMDB",x = 1989, y = 80, size = 3, colour = "orange",angle=90) A spike in popularity in 2004 is obvious owing to the fact that there is a clear rise in popularity score.The creation of facebook and the effect of social media influenced this.
Facebook Likes VS IMDB Score
movie_metadata %>%
plot_ly(x = ~movie_facebook_likes, y = ~imdb_score, color = ~content_rating , mode = "markers", text = ~content_rating, alpha = 0.7, type = "scatter")Top 20 directors with highest average IMDB score
When examinning the directors with the highest IMDB scores, the highest average rating is 8.4. This means that movies with higher ratings are either excellent and/or fluke occurences.
movie_metadata %>%
group_by(director_name) %>%
summarise(avg_imdb = mean(imdb_score)) %>%
arrange(desc(avg_imdb)) %>%
top_n(20, avg_imdb) %>%
formattable(list(avg_imdb = color_bar("orange")), align = 'l')| director_name | avg_imdb |
|---|---|
| Tony Kaye | 8.600000 |
| Damien Chazelle | 8.500000 |
| Majid Majidi | 8.500000 |
| Ron Fricke | 8.500000 |
| Christopher Nolan | 8.425000 |
| Asghar Farhadi | 8.400000 |
| Marius A. Markevicius | 8.400000 |
| Richard Marquand | 8.400000 |
| Sergio Leone | 8.400000 |
| Lee Unkrich | 8.300000 |
| Lenny Abrahamson | 8.300000 |
| Pete Docter | 8.233333 |
| Hayao Miyazaki | 8.225000 |
| Joshua Oppenheimer | 8.200000 |
| Juan José Campanella | 8.200000 |
| Quentin Tarantino | 8.200000 |
| David Sington | 8.100000 |
| Je-kyu Kang | 8.100000 |
| Terry George | 8.100000 |
| Tim Miller | 8.100000 |
The IMDB rating systen started in 1990’s. Social media platforms like Facebook had started in the mid 2000’s. Facebook influenced the number of rates per movie far higher than that of IMDB.
#Performing operations on Movies Vote Count over the years
movies_vote1 <- movie_metadata %>%
select(title_year, num_voted_users) %>%
group_by(title_year) %>%
summarise(count = sum(num_voted_users))
ggplot(movies_vote1, aes(x = title_year, y = count/1000)) +
geom_bar( stat = "identity") +
geom_vline(xintercept = c(1990,2004),colour = c("orange","blue")) +
ylab("Vote count (in thousands)") +
xlab("Years") +
ggplot2::annotate("text", label = "Facebook",x = 2003, y = 160, size = 3, colour = "blue",angle=90) +
ggplot2::annotate("text", label = "IMDB",x = 1989, y = 160, size = 3, colour = "orange",angle=90)Vote Counts VS IMDB score
# breaking num_users_voted into 4 buckets
movie_metadata$vote_bucket <- cut(movie_metadata$num_voted_users,
c(0, 50000, 100000, 300000, 500000))# plotting a boxplot for votes vs imdb_score
bp <- na.omit(movie_metadata) %>%
ggplot(aes(x = vote_bucket, y = imdb_score)) +
geom_boxplot(aes(fill = vote_bucket), alpha = 0.7,
show.legend = F) +
stat_summary(fun.y = mean, geom = "point",
shape = 23) +
coord_flip() +
xlab("User Votes") +
ylab("IMDB Score") +
ggtitle("Box plot for User Votes vs IMDB Score") +
theme_minimal() +
theme(plot.title = element_text(size = 14,
color = "darkgrey",
family = "Times"))
ggplotly(bp) %>%
layout(margin = m,
xaxis = a,
yaxis = a,
legend = list(orientation = "h", size = 4,
bgcolor = "#E2E2E2",
bordercolor = "darkgrey",
borderwidth = 1,
x = 0,
y = -0.3)) # scatter plot for user votes vs imdb score
scatter_plot(movie_metadata$num_voted_users, movie_metadata$imdb_score,
"User Votes",
"IMDB Score",
"Scatter plot for User Votes vs IMDB Score",
"User Votes:",
"<br>IMDB Score:",
"<br>Title:",
alpha = 0.3)Each line represents an average, for the imdb score, votes and user review.
# creating a data frame for average score, avg votes and, avg user reviews
# by title year
scat_year_score <- movie_metadata %>%
group_by(title_year) %>%
summarise(count = n(),
IMDB_Score = round(mean(imdb_score),1),
avg_votes = mean(num_voted_users),
avg_user_review = round(mean(num_user_for_reviews)))
# plotting line graph for Avg score by title year
lp1 <- line_graph(scat_year_score,
scat_year_score$IMDB_Score,
"Average Score")
# plotting line graph for Avg votes by title year
lp2 <- line_graph(scat_year_score,
scat_year_score$avg_votes,
"Average Votes")
# plotting line graph for Avg reviews by title year
lp3 <- line_graph(scat_year_score,
scat_year_score$avg_user_review,
"Average User Review")
subplot(lp1, lp2, lp3, nrows = 3, heights = c(0.33, 0.33, 0.33))Profit | Gross | Return of Interest
budget <- movie_metadata %>%
select(movie_title, budget) %>%
arrange(desc(budget)) %>%
head(15)
x <- ggplot(budget, aes(x = reorder(movie_title, -desc(budget)), y = budget/1000000)) +
geom_bar( stat = "identity")+
theme(axis.text.x=element_text(hjust=1))+
ggtitle("Movie Budgets")+
xlab("")+
ylab("Budget (in Millions)") +
coord_flip()
rev <- movie_metadata %>%
select(movie_title, gross) %>%
arrange(desc(gross)) %>%
head(15)
y <- ggplot(rev, aes(x = (reorder(movie_title, -desc(gross))), y = gross/1000000)) +
geom_bar( stat = "identity")+
theme(axis.text.x=element_text(hjust=1))+
ggtitle("Movie Gross")+
xlab("")+
ylab("Gross (in Millions)") +
coord_flip()
ggarrange(x, y,
labels = c("A", "B"),
ncol = 1, nrow = 2)Most Successful Directors based on Profit
#Top 20 most successful directors
movie_metadata %>%
group_by(director_name) %>%
mutate(profit = gross - budget)%>%
select(director_name, budget, gross, profit) %>%
na.omit() %>%
summarise(films = n(), budget = sum(as.numeric(budget)), gross = sum(as.numeric(gross)), profit = sum(as.numeric(profit))) %>%
mutate(avg_per_film = profit/films) %>%
arrange(desc(avg_per_film)) %>%
top_n(20, avg_per_film) %>%
ggplot( aes(x = films, y = avg_per_film/1000000)) +
geom_point(size = 1, color = "blue") +
geom_text_repel(aes(label = director_name), size = 3, color = "blue") +
xlab("Number of Films") + ylab("Avg Profit $millions") +
ggtitle("Most Successful Directors")Top 20 movies based on its Profit
#Top 20 movies based on its Profit
movie_metadata %>%
filter(title_year %in% c(2000:2016)) %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100) %>%
arrange(desc(profit)) %>%
top_n(20, profit) %>%
ggplot(aes(x=budget/1000000, y=profit/1000000)) +
geom_point(size = 2) +
geom_smooth(size = 1) +
geom_text_repel(aes(label = movie_title), size = 3) +
xlab("Budget $million") +
ylab("Profit $million") +
ggtitle("20 Most Profitable Movies")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Top 20 movies based on its Return on Investment
#Top 20 movies based on its Return on Investment
movie_metadata %>%
filter(budget >100000) %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100) %>%
arrange(desc(profit)) %>%
top_n(20, profit) %>%
ggplot(aes(x=budget/1000000, y=return_on_investment_perc)) +
geom_point(size = 2) +
geom_smooth(size = 1) +
geom_text_repel(aes(label = movie_title), size = 3) +
xlab("Budget $million") +
ylab("Percent Return on Investment") +
ggtitle("20 Most Profitable Movies based on its Return on Investment")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Further Analysis
Commercial success Vs Critical acclaim
movie_metadata %>%
top_n(15, profit) %>%
ggplot(aes(x = imdb_score, y = gross/10^6, size = profit/10^6, color = content_rating)) +
geom_point() +
geom_hline(aes(yintercept = 600)) +
geom_vline(aes(xintercept = 7.75)) +
geom_text_repel(aes(label = movie_title), size = 4) +
xlab("Imdb score") +
ylab("Gross money earned in million dollars") +
ggtitle("Commercial success Vs Critical acclaim") +
ggplot2::annotate("text", x = 8.5, y = 700, label = "High ratings \n & High gross") +
theme(plot.title = element_text(hjust = 0.5))Correlation Heatmap
ggcorr(movie_metadata, label = TRUE, label_round = 2, label_size = 2.8, size = 2, hjust = .85) +
ggtitle("Correlation Heatmap") +
theme(plot.title = element_text(hjust = 0.5))## Warning in ggcorr(movie_metadata, label = TRUE, label_round = 2, label_size
## = 2.8, : data in column(s) 'director_name', 'actor_2_name', 'actor_1_name',
## 'movie_title', 'actor_3_name', 'plot_keywords', 'movie_imdb_link',
## 'language', 'country', 'content_rating', 'genre', 'vote_bucket' are not
## numeric and were ignored
Based on the heatmap, we can see some high correlations (greater than 0.7) between predictors.
According to the highest correlation value 0.95, we find actor_1_facebook_likes is highly correlated with the cast_total_facebook_likes, and both actor2 and actor3 are also somehow correlated to the total. So we want to modify them into two variables: actor_1_facebook_likes and other_actors_facebook_likes.
There are high correlations among num_voted_users, num_user_for_reviews and num_critic_for_reviews.
Sentiment Analysis
IMDB <- read.csv("../data/real_deal.csv")
#Removing Duplicates
IMDB <- IMDB[!duplicated(IMDB), ]
#Removing White-Spaces and Special Characters
IMDB$movie_title <- gsub("Â", "", as.character(factor(IMDB$movie_title)))
IMDB$movie_title <- str_trim(IMDB$movie_title, side = "right")
IMDB <- subset(IMDB, select = -c(genres))
colSums(sapply(IMDB, is.na))## color director_name
## 0 0
## num_critic_for_reviews duration
## 49 15
## director_facebook_likes actor_3_facebook_likes
## 103 23
## actor_2_name actor_1_facebook_likes
## 0 7
## gross actor_1_name
## 874 0
## movie_title num_voted_users
## 0 0
## cast_total_facebook_likes actor_3_name
## 0 0
## facenumber_in_poster plot_keywords
## 13 0
## movie_imdb_link num_user_for_reviews
## 0 21
## language country
## 0 0
## content_rating budget
## 0 487
## title_year actor_2_facebook_likes
## 107 13
## imdb_score aspect_ratio
## 0 327
## movie_facebook_likes
## 0
#Plotting Facebook likes against IMDB score
ggplot(data = IMDB, aes(x = imdb_score, y = movie_facebook_likes)) +
geom_point() +
stat_smooth(method="lm")#Plotting Aspect Ratio against Facebook likes
IMDB %>%
filter(aspect_ratio==1.85 | aspect_ratio==2.35) %>%
mutate(aspect_ratio=as.character((aspect_ratio))) %>%
ggplot() +
geom_point(aes(x=aspect_ratio, y=movie_facebook_likes))#Constructing Top 20
keywords_split <- str_split(IMDB$plot_keywords, pattern="[|]", n=5)
keywords_matrix <- do.call(rbind, strsplit(as.character(IMDB$plot_keywords), "[|]"))## Warning in (function (..., deparse.level = 1) : number of columns of result
## is not a multiple of vector length (arg 99)
keywords_df <- as.data.frame(keywords_matrix)
names(keywords_df) <- c("one", "two", "three", "four", "five")
keywords_one_col <- gather(keywords_df) %>%
select(value)
keywords_one_col_freq <- keywords_one_col %>%
group_by(value) %>%
tally()
top_20 <- keywords_one_col_freq %>%
select(value, n) %>%
top_n(20)## Selecting by n
keywords_one_col %>%
group_by(value) %>%
tally() %>%
filter(n > 30) %>%
ggplot() +
geom_bar(aes(x = value, y=n), stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) ## # A tibble: 20 x 2
## value n
## <chr> <int>
## 1 alien 82
## 2 boy 72
## 3 death 132
## 4 dog 58
## 5 drugs 66
## 6 fbi 71
## 7 female protagonist 58
## 8 friend 165
## 9 friendship 67
## 10 high school 89
## 11 island 57
## 12 love 194
## 13 marriage 60
## 14 money 61
## 15 murder 160
## 16 new york city 92
## 17 police 127
## 18 prison 62
## 19 revenge 70
## 20 school 73
#Placing Top 20 Words Against Movie Success
IMDB_true_false <- IMDB
for (keyword in top_20$value) {
IMDB_true_false <- cbind(IMDB_true_false, ifelse(str_detect(IMDB$plot_keywords, keyword), "TRUE", "FALSE"))
}
for (i in 1:20) {
reference <- 27
names(IMDB_true_false)[reference + i] <- top_20$value[i]
}
truey <- data.frame()
for (keyword in top_20$value) {
truey <- rbind(truey, IMDB_true_false %>%
filter(get(keyword) == TRUE) %>% select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords)) %>%
distinct(movie_title, .keep_all = T)
}
falsy <- data.frame()
for (keyword in top_20$value) {
falsy <- rbind(falsy, IMDB_true_false %>%
filter(get(keyword) == FALSE) %>% select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords)) %>%
distinct(movie_title, .keep_all = T)
}truey <- truey %>%
mutate(tri = "Top 20 Word")
falsy <- falsy %>%
mutate(tri = "NOT Top 20 Word")
truey_falsy <- full_join(falsy, truey, by = c("movie_title", "gross", "imdb_score", "movie_facebook_likes", "plot_keywords"))
truey_falsy <- truey_falsy %>%
mutate(tri = coalesce(tri.y, tri.x)) %>%
select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords, tri)
truey_falsy <- truey_falsy %>%
group_by(tri) %>%
na.omit() %>%
mutate(avg = mean(gross))
truey_falsy_graph <- summarise(truey_falsy, avg = mean(gross))truey_falsy_graph %>%
ggplot() +
geom_bar(aes(x = tri, y=avg, fill=tri), stat="identity", position="stack") +
theme(axis.text.x = element_blank()) #Sentiment
keywords_from_split <- data.frame(lapply(keywords_split, "length<-", max(lengths(keywords_split))))
#Creating a list of titles and years seperatored by "|"
titles <- IMDB %>%
select(movie_title)
years <- IMDB %>%
select(title_year)
titles_years_list <- list()
for (x in 1:4998) {
seperator="|"
titles_years_list <- c(titles_years_list, paste(titles$movie_title[x], years$title_year[x], sep="|"))
}
#Renaming columns to movie names
movie_name_list <- IMDB$movie_title
names(keywords_from_split) <- titles_years_list
#Adding column to the begining
key_names <- c('key_word_1', 'key_word_2', 'key_word_3', 'key_word_4', 'key_word_5')
keywords_from_split <- cbind(test_keys = key_names, keywords_from_split)
#Moving columns to rows
keywords_from_split <- melt(keywords_from_split, id = "test_keys")
#Creating new column for year
keywords_from_split_matrix <- do.call(rbind, strsplit(as.character(keywords_from_split$variable), "[|]"))
keywords_from_split <- cbind(keywords_from_split, keywords_from_split_matrix)
names(keywords_from_split) <- c("Key #", "Combined", "Keyword", "Title", "Year")
keywords_from_split = keywords_from_split %>%
select(`Key #`, Title, Year, Keyword)
#Replacing blank cells within keywords to NA's
keywords_from_split$Keyword[keywords_from_split$Keyword == ""] <- NA
#Removing rows containing NA's
keywords_from_split <- na.omit(keywords_from_split)
#All unique years
all_years <- keywords_from_split %>% select(Year)
all_years <- distinct(all_years)
all_years <- na.omit(all_years)
all_years <- all_years %>%
filter(Year != "NA")
#Function for sentiment per year
yearly_sentiment <- function(year, df) {
amount <- nrow(df %>%
select(Year) %>%
filter(Year == year))
df <- df %>%
filter(Year == year)
sentiments <- get_nrc_sentiment(as.character(df[4]))
for (i in 1:length(sentiments)) {
sentiments[i] <- sentiments[i]/amount
}
year_sentiment <- cbind(year, sentiments)
return (year_sentiment)
}
sentiments <- data.frame()
#For-loop to capture all years
for (i in all_years$Year) {
sentiments <- rbind(sentiments,yearly_sentiment(i, keywords_from_split))
}
#Making years integers
sentiments$year <- strtoi(sentiments$year)
#Sort by year
sentiments <- sentiments[with(sentiments, order(year)), ]
#sentiments <- sentiments[order(year), ]
#Heatmap for sentiments
rnames <- sentiments[,1]
mat_sentiments <- data.matrix(sentiments[,2:ncol(sentiments)])
rownames(mat_sentiments) <- rnames
mat_sentiments <- t(mat_sentiments)
df_sentiment <- as.data.frame(mat_sentiments)
names_emotions <- c("anger", "anticipation", "disgust","fear","joy","sadness","surprise","trust","negative","positive")
sentiments_graph <- cbind(names_emotions, df_sentiment)## null device
## 1
#Heatmap
heatmap.2(mat_sentiments, Rowv=NA, Colv=NA, scale="row", col=colorRampPalette(c("white","darkblue")), margins=c(5,10), trace = "none")## Warning in heatmap.2(mat_sentiments, Rowv = NA, Colv = NA, scale = "row", :
## Discrepancy: Rowv is FALSE, while dendrogram is `both'. Omitting row
## dendogram.
## Warning in heatmap.2(mat_sentiments, Rowv = NA, Colv = NA, scale = "row", :
## Discrepancy: Colv is FALSE, while dendrogram is `column'. Omitting column
## dendogram.
#Filtering sentiments graph
sentiments_graph_filter <- sentiments %>%
filter(year >= 2000)
#Other graphs
# Showing trends between sentiment from the 2000's
ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = anger)) +
geom_point(alpha = 0.5) +
geom_line()ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = anticipation)) +
geom_point(alpha = 0.5) +
geom_line()ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = disgust)) +
geom_point(alpha = 0.5) +
geom_line()ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = fear)) +
geom_point(alpha = 0.5) +
geom_line()ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = joy)) +
geom_point(alpha = 0.5) +
geom_line()ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = sadness)) +
geom_point(alpha = 0.5) +
geom_line()ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = surprise)) +
geom_point(alpha = 0.5) +
geom_line()ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = trust)) +
geom_point(alpha = 0.5) +
geom_line()ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = negative)) +
geom_point(alpha = 0.5) +
geom_line()ggplot(sentiments_graph_filter, aes(x = as.numeric(year), y = positive)) +
geom_point(alpha = 0.5) +
geom_line()